DATA

We first read the following data, implementing the code below:

## read data and convert candidate names and party names from string to factor
election.raw <- read_csv("candidates_county.csv", col_names = TRUE) %>% 
  mutate(candidate = as.factor(candidate), party = as.factor(party))

## remove the word "County" from the county names
words.to.remove = c("County")
remove.words <- function(str, words.to.remove){
  sapply(str, function(str){
    x <- unlist(strsplit(str, " "))
    x <- x[!x %in% words.to.remove]
    return(paste(x, collapse = " "))
  }, simplify = "array", USE.NAMES = FALSE)
}
election.raw$county <- remove.words(election.raw$county, words.to.remove)

## read census data
census <- read_csv("census_county.csv") 

Election data

1. Report the dimension of election.raw. Are there missing values in the data set? Compute the total number of distinct values in state in election.raw to verify that the data contains all states and a federal district.

#dimensions of election.raw
dim(election.raw)
## [1] 31167     5
#any missing values in election.raw?
sum(is.na(election.raw))
## [1] 0
#total number of distinct values in state
length(unique(election.raw$state))
## [1] 51

The dimension of election.raw is 31167 rows and 5 columns. There are no missing values in election.raw dataset. There are 51 distinct values in the state column of election.raw, which accounts for the 50 states and The Federal District of Columbia (i.e. Washington DC).

Census data

2. Report the dimension of census. Are there missing values in the data set? Compute the total number of distinct values in county in census. Compare the values of total number of distinct county in census with that in election.raw. Comment on your findings.

#dimension of census
dim(census)
## [1] 3220   37
#any missing values in census?
sum(is.na(census))
## [1] 1
#total number of distinct values in county of census
length(unique(census$County))
## [1] 1955
#compare to total number of distinct values in county of election.raw
length(unique(election.raw$county))
## [1] 2825

The dimension of census is 3220 rows and 37 coulmns. There is 1 missing value in the census dataset. The total number of distinct values for the county column in the census data is 1955, while in the election.raw data there are 2825 distinct values for the county column. So, we can see that the census data seems to have 870 less distinct values for county than the election.raw data.

Data wrangling

3. Construct aggregated data sets from election.raw data: i.e.,

- Keep the county-level data as it is in election.raw.

- Create a state-level summary into a election.state.

- Create a federal-level summary into a election.total.

#create state-level summary for election.state
election.state = election.raw %>%
  dplyr::group_by(state, candidate) %>%
  dplyr::summarise(votes = sum(votes))

head(election.state)
#create federal-level summary for election.total
election.total = election.raw %>%
  dplyr::group_by(candidate) %>%
  dplyr::summarise(votes = sum(votes))

head(election.total)

4. How many named presidential candidates were there in the 2020 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on a log scale. Either way, the results should be clear and legible! (For fun: spot Kanye West among the presidential candidates!)

#number of presidential candidates in 2020 election
length(election.total$candidate)
## [1] 38
#bar chart of votes per candidate 
subdat <- election.total[1:10,]
subdat1<- election.total[11:20,]
subdat2<- election.total[21:30,]
subdat3<- election.total[31:38,]

ggplot(data=subdat, aes(x=candidate, y=log(votes))) + geom_col(fill=rainbow(10)) + geom_text(aes(label=votes), hjust=-0.2, size=3)  + coord_flip() + labs(title="National Votes per Candidate; part 1", y="log(national votes/candidate)") + scale_y_continuous(limit=c(0,22))

ggplot(data=subdat1, aes(x=candidate, y=log(votes))) + geom_col(fill=rainbow(10)) + geom_text(aes(label=votes), hjust=-0.2, size=2.5)  + coord_flip() + labs(title="National Votes per Candidate; part 2", y="log(national votes/candidate)") + scale_y_continuous(limit=c(0,22))

ggplot(data=subdat2, aes(x=candidate, y=log(votes))) + geom_col(fill=rainbow(10)) + geom_text(aes(label=votes), hjust=-0.2, size=3)  + coord_flip() + labs(title="National Votes per Candidate; part 3", y="log(national votes/candidate)") + scale_y_continuous(limit=c(0,22))

ggplot(data=subdat3, aes(x=candidate, y=log(votes))) + geom_col(fill=rainbow(8)) + geom_text(aes(label=votes), hjust=-0.2, size=3)  + coord_flip() + labs(title="National Votes per Candidate; part 4", y="log(national votes/candidate)") + scale_y_continuous(limit=c(0,22))

We can see that there were 38 named presidential candidates in the 2020 election. The data is split into 4 horizontal bar charts on a log scale to ease visualization, with the raw number of national votes each candidate got expressed to right of their respective bar. We can clearly see that Joe Biden and Donald Trump received the most votes nationally, and that Kanye West received 64379.

**5. Create data sets county.winner and state.winner by taking the candidate with the highest proportion of votes in both county level and state level.

We create county.winner by taking the election.raw dataset and grouping by county and state. We then compute total_votes as the sum of the votes column of each county, and pct = votes/total_votes as the proportion of votes. We then choose the highest row using top_n to get the candidate with the highest proportion of votes at the county level.

#create county.winner
county.winner = election.raw %>%
  dplyr::group_by(county, state) %>%
  dplyr::mutate(total_votes = sum(votes), pct=votes/total_votes) %>%
  top_n(1)

head(county.winner)

We then create state.winner by taking the election.raw dataset and grouping by just state. Similarly to county.winner, we compute total_votes as the sum of the votes column of each state, and pct = votes/total_votes as the proportion of votes. We then choose the highest row using top_n to get the candidate with the highest proportion of votes at the state level.

#create state.winner
state.winner = election.state %>%
  dplyr::group_by(state) %>%
  dplyr::mutate(total_votes = sum(votes), pct=votes/total_votes) %>%
  top_n(1)
## Selecting by pct
head(state.winner)

Visualization

library(ggplot2)
library(maps)
states <- map_data("state")
head(states)
ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group),
               color = "white") + 
  coord_fixed(1.3) +
  labs(title="State-Level Map", y="Latitude", x="Longitude") + 
  guides(fill=FALSE)  # color legend is unnecessary and takes too long

6. Use similar code to above to draw county-level map by creating counties = map_data(“county”). Color by county.

counties= map_data("county")
head(counties)
ggplot(data = counties) + 
  geom_polygon(aes(x = long, y = lat, fill = region, group = group),
               color = "white") + 
  coord_fixed(1.3) +
  labs(title="County-Level Map", y="Latitude", x="Longitude") + 
  guides(fill=FALSE)

7. Now color the map by the winning candidate for each state. First, combine states variable and state.winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values:

states <- map_data("state")

#make state column of state.winner all lowercase
state.winner.lower = state.winner %>% mutate(state = tolower(state))

#left join states and state.winner.lower
states=left_join(states, state.winner.lower, by = c("region" = "state"))

ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = region),colour="white" ) + 
  coord_fixed(1.3) +
  guides(fill=FALSE) + 
  ggtitle("Winning Candidate by State") + labs(y = "Latitude", x = "Longitude")

8. Color the map of the state of California by the winning candidate for each county. Note that some county have not finished counting the votes, and thus do not have a winner. Leave these counties uncolored.

To get a map of the winning candidate for each county in the state of California, we subset county.winner and counties to only include California data. We then make all of the state and county names lowercase in county.winner so that we can left join this data with counties.

counties= map_data("county")

#subset California county.winner data, make county & state columns all lowercase
cal.county.winner = county.winner %>%
  subset(state == "California") %>%
  mutate(county = tolower(county), state=tolower(state))

#subset California counties data
cal.counties = counties %>% subset(region=="california")

#left join cal.counties and cal.county.winner
cal.countyjoined=left_join(cal.counties,cal.county.winner,by=c("subregion"="county"))

ggplot(data = cal.countyjoined) + 
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),colour="white" ) + 
  coord_fixed(1.3) +
  guides(fill=FALSE) +
  labs(title="Winning Candidate by CA County", y="Latitude", x="Longitude")

We must now check to see if there are discrepancies in the county names of the datasets, to double check that our map is correct.

#check for county name discrepancies between datasets
all(
  sort(unique(map_data("county")[map_data("county")$region == "California", ]$subregion)) ==
  sort(tolower(unique(election.raw[election.raw$state == "California", ]$county)))
)
## [1] TRUE
setdiff(cal.county.winner$county, cal.counties$subregion)
## character(0)

Our data does not show any discrepancies in county names between the counties and election.raw data, so we are obtaining election winner information for every California county.

9. (Open-ended) Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

We decided to create 2 interactive heat maps of the minority population percentages of states in the US, with the first map outlining the populations of states won by Donald Trump, and the second with the states won by Joe Biden in the 2020 presidential election.

library(tidyverse)
library(hrbrthemes)
library(viridis)
library(plotly)
library(heatmaply)

#demographic heatmap
census.vis <- census %>%
  dplyr::group_by(State) %>%
  dplyr::summarise(White = sum((White/100)*TotalPop), 
                   Hispanic = sum((Hispanic/100)*TotalPop), 
                   Black = sum((Black/100)*TotalPop), 
                   Native = sum((Native/100)*TotalPop), 
                   Asian = sum((Asian/100)*TotalPop), 
                   Pacific = sum((Pacific/100)*TotalPop),
                MinorityTotal = Hispanic + Black + Native + Asian + Pacific,
                HispanicMinPct = (Hispanic/MinorityTotal)*100, 
                BlackMinPct = (Black/MinorityTotal)*100, 
                NativeMinPct = (Native/MinorityTotal)*100, 
                AsianMinPct = (Asian/MinorityTotal)*100, 
                PacificMinPct = (Pacific/MinorityTotal)*100)

#how many states has higher white population than minority population?
sum(census.vis$White > census.vis$MinorityTotal)
## [1] 46

We can see that 46/51 states (since District of Columbia is included) have a higher white population than minority population, so we opted to display a breakdown of the minority population as opposed to the complete demographic population. Since the demographic information in the census data is broken down into percentages, we first grouped the data by “State” and converted each demographic percentage value to a raw population number by dividing the percentage by 100 and multiplying each demographic percentage by “TotalPop”. We then created a “MinorityTotal” column, adding up all of the minority population numbers per state. Following this, we found the population percentage that each minority demographic makes up of the total minority demographic in the accompanying “MinPct” columns by taking each demographic population total and dividing it by “MinorityTotal”, and saved all of this information in the dataframe “census.vis”. To create our respective heatmaps, we filtered this dataframe to for either states that went to Trump (census.trump) or states that went to Biden (census.biden), deleted all variables that were not minority percentages, and converted the new dataframes into matrices with the states as row names.

#minority demographic breakdowns in Trump states vs Biden states
#Trump states
trump_states = state.winner %>% 
  subset(candidate == "Donald Trump") %>%
  select(state)

census.trump <- census.vis %>% 
  subset(State %in% trump_states$state) %>%
  dplyr::select(-State, -White, -Hispanic, -Black, -Native, -Asian, -Pacific, -MinorityTotal)
census.trump <- as.matrix(census.trump)
rownames(census.trump) = trump_states$state

trumpminpct.heat <- heatmaply(census.trump, 
        dendrogram = "none",
        xlab = "Demographic", ylab = "State", 
        main = "Minority Pop. Percentage per Trump State",
        grid_color = "white",
        branches_lwd = 0.1,
        label_names = c("State", "Demographic:", "Pop. Percentage"),
        fontsize_row = 5, fontsize_col = 8,
        labCol = c("Hispanic", "Black", "Native", "Asian", "Pacific"),
        labRow = rownames(census.trump))
trumpminpct.heat
#Biden states
biden_states = state.winner %>% 
  subset(candidate == "Joe Biden") %>%
  select(state)

census.biden <- census.vis %>% 
  subset(State %in% biden_states$state) %>%
  dplyr::select(-State, -White, -Hispanic, -Black, -Native, -Asian, -Pacific, -MinorityTotal)
census.biden <- as.matrix(census.biden)
rownames(census.biden) = biden_states$state

bidenminpct.heat <- heatmaply(census.biden, 
        dendrogram = "none",
        xlab = "Demographic", ylab = "State", 
        main = "Minority Pop. Percentage per Biden State",
        grid_color = "white",
        branches_lwd = 0.1,
        label_names = c("State", "Demographic:", "Pop. Percentage"),
        fontsize_row = 5, fontsize_col = 8,
        labCol = c("Hispanic", "Black", "Native", "Asian", "Pacific"),
        labRow = rownames(census.biden))
bidenminpct.heat

Each of the heatmaps are interactive, with the state, demographic, and minority population percentage appearing upon hovering over any square. We can draw many conclusions from comparing these 2 heatmaps. We can see that Biden won many more majority Hispanic states than Trump, while Trump won more majority Black states than Biden. We can also see that Trump seemed to have won many states with a large Native population, while Biden was favored in many states with a larger Asian population.

10. The census data contains county-level census information. In this problem, we clean and aggregate the information as follows.

Clean county-level census data census.clean: start with census, filter out any rows with missing values, convert {Men, Employed, VotingAgeCitizen} attributes to percentages, compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific},remove these variables after creating Minority, remove {IncomeErr, IncomePerCap, IncomePerCapErr, Walk, PublicWork, Construction}. Many columns are perfectly colineared, in which case one column should be deleted.

Print the first 5 rows of census.clean:

census.clean<- census

#filter rows with missing values
census.clean <- census.clean[complete.cases(census.clean),]

#convert attributes
census.clean <- census.clean %>% 
  mutate(Men = 100*(Men/TotalPop), 
          Employed = 100*(Employed/TotalPop),
          VotingAgeCitizen = 100*(VotingAgeCitizen/TotalPop),
          Minority = (Hispanic + Black + Native + Asian + Pacific)) %>% 
          select(-(Black:Pacific), -Hispanic, -IncomeErr,-IncomePerCap, -IncomePerCapErr, -Walk, -PublicWork, -Construction, -Women, -White)

head(census.clean)

To reduce perfect multicollinearity, we decided to take out Walk, PublicWork, Construction, and Women from the dataset. The columns “Drive”, “Carpool”, “Transit”, “Walk”, “OtherTransp”, and “WorkAtHome” are all percentages related to means of transportation that people take to work, so subtracting the sum of all of these values from 100 for a certain row would give the value for “Walk”. The column “PublicWork” exhibits a similar pattern with “PrivateWork”, “SelfEmployed”, and “FamilyWork” in relation to type of work a person has, so that is removed as well. “Construction” also has a similar pattern with “Professional”, “Service”, “Office”, and “Production” in terms of genre of work, so it is removed. Finally, “Women” is also removed since it is essentially the difference between “TotalPop” and “Men”, and so its percentage value can be solved by subtracting the “Men” percentage from 100. We can also take out “White” since it is just the difference of 100-“Minority”.

Dimensionality reduction

11. Run PCA for the cleaned county level census data (with State and County excluded). Save the first two principle components PC1 and PC2 into a two-column data frame, call it pc.county. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correlation between these features?

We additionally remove the “CountyId” column from the dataset when running the PCA, as it does not provide any information about facets of the county populations, and just serves as a unique number identifier for the County. In this way, it is identical to “County”, so is not necessary.

pr.out=prcomp(subset(census.clean, select = -c(State, County, CountyId)), center=TRUE, scale=TRUE)

#first two principle components
pc.county = pr.out$x[,1:2]

#features of first principal component
pr.out$rotation[,1]
##         TotalPop              Men VotingAgeCitizen           Income 
##      0.058440784      0.022642585     -0.018605804      0.354108024 
##          Poverty     ChildPoverty     Professional          Service 
##     -0.389634369     -0.392757998      0.269920786     -0.204870027 
##           Office       Production            Drive          Carpool 
##     -0.065368726     -0.120212424     -0.137565113     -0.062823634 
##          Transit      OtherTransp       WorkAtHome      MeanCommute 
##      0.071333963     -0.005265194      0.225134456     -0.084240223 
##         Employed      PrivateWork     SelfEmployed       FamilyWork 
##      0.372620946      0.051964548      0.130577865      0.069151428 
##     Unemployment         Minority 
##     -0.339531128     -0.257442698
#largest absolute loadings of first principal component
head(sort(abs(pr.out$rotation[,1]), decreasing=TRUE), 3)
## ChildPoverty      Poverty     Employed 
##    0.3927580    0.3896344    0.3726209

We decided to scale our data since all of the columns have so many different units. The majority of the columns are in percentage form, but columns like “TotalPop” and “Income” just show raw amounts. Scaling would help standardize the data before the analysis. We also decided to center the data as the census.clean dataset contains many continuous categorical variables and since many of our predictor variables interact with each other, even though we have cleaned the dataset for collinearity. Variables like “FamilyWork”, “PrivateWork”, and “SelfEmployed” are related in that they (plus “PublicWork” which has been removed) all add up to 100 to account for the type of work the population is doing, and so they interact with each other. Due to instances like this, scaling and centering were set to TRUE.

We can see that “ChildPoverty”, “Poverty”, and “Employed” have the highest absolute loadings of 0.3927580, 0.3896344, and 0.3726209 respectively. We know that the loadings are essentially the coefficients of the linear combination of all of the features, so we can say that these three variables would have the greatest effect on the first principle component and contribute most to it. However without the absolute value, we can see that “ChildPoverty” and “Poverty” have negative PC1 loadings while “Employed” is positive. As we know with any linear combination, the response decreases as the value of the variables with the negative coefficients increase, and the response increases as the value of the variables with the positive coefficients increase. So as the values of “ChildPoverty” and “Poverty” increase PC1 would decrease, and as “Employed” increases PC1 would as well. Since our data is standardized, we can say that since these three variables have the largest absolute loadings, they are most correlated with the first principal component.

We can see that there are many loadings that are positive and many that are negative, and can try to draw patterns from this. We know that “Professional”, “Service”, “Office”, and “Production” are related in census.clean as additive percentage values representing genre of work, and can see that “Professional” has a positive PC1 loading while “Service”, “Office”, and “Production” all have negative PC1 loadings. So, we can infer that as “Professional” decreases, the other three work genres would increase. We can also see that “Employment” and “Unemployment” have opposite signs, so know that there is a negative correlation between these as well.

12. Determine the minimum number of PCs needed to capture 90% of the variance for the analysis. Plot proportion of variance explained (PVE) and cumulative PVE.

#PVE
pr.var=(pr.out$sdev)^2
pve=pr.var/sum(pr.var)

#plot PVE
plot(pve, type="b", xlab="Principal Component", ylab="PVE", ylim=c(0,1), main = "PVE per Principal Component")

#cumulative PVE
cumulative_pve <- cumsum(pve)

#plot cumulative PVE
plot(cumulative_pve, type="b", xlab="Principal Component ", ylab=" Cumulative PVE ", ylim=c(0,1), main = "Cumulative PVE per Principal Component", col = "blue")

#how many PC's to capture 90% of variance?
which(cumulative_pve >= 0.9)[1]
## [1] 13

From plotting the PVE and cumulative PVE, we can see that the minimum number of PC’s needed to capture 90% of the variance of the analysis is 13.

Clustering

13. With census.clean (with State and County excluded), perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 2 principal components from pc.county as inputs instead of the original features. Compare the results and comment on your observations. For both approaches investigate the cluster that contains Santa Barbara County. Which approach seemed to put Santa Barbara County in a more appropriate clusters? Comment on what you observe and discuss possible explanations for these observations.

We first perform hierarchical clustering with complete linkage of the scaled census.clean whole dataset before the cutting the tree into 10 clusters.

set.seed(123)
#scale and center data
census.clean.scale = scale(census.clean[, -c(1,2,3)], center=TRUE, scale=TRUE)

#hierarchical clustering with all of census.clean
census.dist = dist(census.clean.scale)
census.hclust = hclust(census.dist, method="complete")
clus = cutree(census.hclust, 10)
table(clus)
## clus
##    1    2    3    4    5    6    7    8    9   10 
## 2874  172    6   17   75   35    1    6   29    4
#which cluster is Santa Barbara County in?
sbid = which(census.clean$County == "Santa Barbara County")
clus[sbid]
## [1] 1

We can observe the relative sizes of the clusters, and note that “Santa Barbara County” is in cluster 1, the largest cluster with 2874 counties.

We now perform hierarchical clustering with complete linkage on the first 2 principal components from pc.county, obtained in #11.

set.seed(123)
#hierarchical clustering w/ pc.county
pc.county.dist = dist(pc.county)

set.seed(123)
pc.county.hclust = hclust(pc.county.dist, method="complete")
clus3 = cutree(pc.county.hclust, 10)
table(clus3)
## clus3
##    1    2    3    4    5    6    7    8    9   10 
## 2087  339  394  102   77  139   38   29    7    7
#which cluster is Santa Barbara County in?
sbclus = pc.county.hclust$order[sbid]
clus3[sbclus]
## [1] 3

We get much more evenly distributed clusters from this method. While the first cluster is still the largest by a longshot, holding 2087 counties, there are 4 other clusters with a few hundred counties, unlike our first clustering attempt which just had cluster 2 with 172 counties. We can also see that Santa Barbara County is in cluster 3 with the PCA clustering method, this cluster containing 394 counties. Between these two methods, it seems that the PCA clustering method has put Santa Barbara County in a more appropriate cluster. We are working with a large dataset with many continuous variables, so performing a a principal component analysis before the hierarchical clustering analysis reduces the dimension of the data and noise contained, allowing for more efficient clustering. This is seen in our results as the clusters from using pc.county are much more evenly distributed.

Since both of these datasets are so big, plotting dendrograms to visualize Santa Barbara Sounty’s cluster is very difficult, so we have additionally run hierarchical clustering on all of the California census data and plotting the accompanying dendrogram.

We additionally once again remove the “CountyId” column from the dataset when running the clustering, as it does not provide any information about facets of the county populations, and just serves as a unique number identifier for the County. In this way, it is identical to “County”, so is not necessary.

set.seed(123)
#hierarchical clustering with California data

#find which indices of census.clean are California data
caindex = which(census.clean$State == "California")

#get subset of census.clean that will include SB County
census.subset = census.clean[caindex,] %>%
  mutate(County = gsub(" County|  Parish", "", County))

#hierarchical clustering with census.subset
#scale data b/c otherwise dendrogram is rlly big
census.dist = dist(scale(subset(census.subset, select = -c(State, County, CountyId))))

census.hclust = hclust(census.dist, method="complete")
clus2 = cutree(census.hclust,10)
table(clus2)
## clus2
##  1  2  3  4  5  6  7  8  9 10 
## 13  4 13 15  2  5  1  1  2  2
#plot dendrogram
dend1 = as.dendrogram(census.hclust)
# color branches and labels by 10 clusters
dend1 = color_branches(dend1, k=10)
dend1 = color_labels(dend1, k=10)
# labels
dend1 = set(dend1, "labels_cex", 0.5)
dend1 = set_labels(dend1, labels=census.subset$County[order.dendrogram(dend1)])

plot(dend1, horiz=F, main = "Dendrogram colored by ten clusters")
abline(h=12, col="green", lty=2)
abline(h = 4, col = "red", lty=2)

#which cluster is Santa Barbara County in?
sbclus = which(census.subset$County == "Santa Barbara")
clus2[sbclus]
## [1] 6

We have identified Santa Barbara County to be in cluster 6, and can see this depicted in our dendrogram, with the 6th cluster in purple. We can see that Santa Barbara County is clustered with Inyo, Monterey, Santa Cruz, and Yolo counties.

Classification

We start considering supervised learning tasks now. The most interesting/important question to ask is: can we use census information in a county to predict the winner in that county?

In order to build classification models, we first need to combine county.winner and census.clean data. This seemingly straightforward task is harder than it sounds. For simplicity, the following code makes necessary changes to merge them into election.cl for classification.

# we move all state and county names into lower-case
tmpwinner <- county.winner %>% ungroup %>%
  mutate_at(vars(state, county), tolower)

# we move all state and county names into lower-case
# we further remove suffixes of "county" and "parish"
tmpcensus <- census.clean %>% mutate_at(vars(State, County), tolower) %>%
  mutate(County = gsub(" county|  parish", "", County)) 

# we join the two datasets
election.cl <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

# drop levels of county winners if you haven't done so in previous parts
election.cl$candidate <- droplevels(election.cl$candidate)

## save meta information
election.meta <- election.cl %>% select(c(county, party, CountyId, state, votes, pct, total_votes))

## save predictors and class labels
election.cl = election.cl %>% select(-c(county, party, CountyId, state, votes, pct, total_votes))

14. Understand the code above. Why do we need to exclude the predictor party from election.cl?

We need to exclude the predictor party from election.cl as since the only two candidates present in election.cl are Donald Trump and Joe Biden, their party preference goes without saying. We are excluding any irrelevant columns from the dataset.

Using the following code, partition data into 80% training and 20% testing:

set.seed(10) 
n <- nrow(election.cl)
idx.tr <- sample.int(n, 0.8*n) 
election.tr <- election.cl[idx.tr, ]
election.te <- election.cl[-idx.tr, ]

Use the following code to define 10 cross-validation folds:

set.seed(20) 
nfold <- 10
folds <- sample(cut(1:nrow(election.tr), breaks=nfold, labels=FALSE))

Using the following error rate function. And the object records is used to record the classification performance of each method in the subsequent problems.

calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=3, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso")

###Classification 15. Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records object. Interpret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior.

We first train our decision tree and plot the unpruned tree.

set.seed(123)
#train decision tree
tree.election = tree(candidate ~., data=election.tr)

#plot tree before pruning
draw.tree(tree.election, nodeinfo=TRUE, cex = 0.3)
title("Decision Tree Before Pruning")

#test error on unpruned tree
tree.pred = predict(tree.election, election.te, type="class")
calc_error_rate(tree.pred, election.te$candidate)
## [1] 0.1003236

Before pruning, this tree has 14 branches. We know that this tree can be improved by pruning, so we determine the best size to prune the tree to with the cv.tree() function. We prune the tree to minimize misclassification error, that we get as 0.1003 with the unpruned tree.

#determine best size to prune tree  
cv = cv.tree(tree.election, FUN=prune.misclass, rand=folds)
cv$size
## [1] 14 11  8  5  3  1
cv$dev
## [1] 220 220 228 226 277 404
best.cv = min(cv$size[cv$dev == min(cv$dev)])
best.cv
## [1] 11

There are 2 trees that both have the minimum cross-validation estimate of test error rate from our K-fold cross validation, where “folds” is previously defined. We favor the tree of smaller size, so we pick size 11 as the best size to prune the tree.

#prune tree
prune.election = prune.misclass(tree.election,best=best.cv)

#plot tree after pruning
draw.tree(prune.election, nodeinfo=TRUE, cex=.3)
title("Pruned Decision Tree of Size 11")

#training error
tree.prob.train = predict(prune.election, election.tr, type="class")
tree.train.error = calc_error_rate(tree.prob.train, election.tr$candidate)
tree.train.error
## [1] 0.06520859
#test error
tree.prob.test = predict(prune.election, election.te, type="class")
tree.test.error = calc_error_rate(tree.prob.test, election.te$candidate)
tree.test.error
## [1] 0.1003236
#save decision tree training and test error in records
records[1,1] = tree.train.error
records[1,2] = tree.test.error
records
##          train.error test.error
## tree      0.06520859  0.1003236
## logistic          NA         NA
## lasso             NA         NA

This pruned tree now has 11 branches. We calculated our test error rate with the unpruned tree and yielded 0.1003236, and got the same test error rate for the pruned tree. So, we prefer the pruned tree since pruning yields a simpler tree without any cost in prediction error rate.

The pruned decision tree definitely tells a story about voting behavior. Each split is such that the cases with lower values go to the left and higher values to the right (<>). In the splits on “Minority” leading to branches (8) and (11), we can see that the counties with larger minority populations tended to vote for Joe Biden, so we can infer that Biden resonated more with non-white individuals. We can also see from all three splits on “Professional” that counties with more professional individuals resonated with Joe Biden, since every split has Biden listed on the right branch. Finally, we can also see that Biden resonated more in counties with more unemployed voters than Trump did.

16. Run a logistic regression to predict the winning candidate in each county. Save training and test errors to records variable. What are the significant variables? Are they consistent with what you saw in decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables.

set.seed(123)
#logistic regression
log.election = glm(candidate ~., data = election.tr, family="binomial")

#training error
log.prob.train = predict(log.election, election.tr, type="response")
log.train.error = calc_error_rate(ifelse(log.prob.train <= 0.4, "Donald Trump", "Joe Biden"), election.tr$candidate)
log.train.error
## [1] 0.06399352
#test error
log.prob.test = predict(log.election, election.te, type="response")
log.test.error = calc_error_rate(ifelse(log.prob.test <= 0.4, "Donald Trump", "Joe Biden"), election.te$candidate)
log.test.error
## [1] 0.07443366
#save logistic regression training and test error in records
records[2,1] = log.train.error
records[2,2] = log.test.error
records
##          train.error test.error
## tree      0.06520859 0.10032362
## logistic  0.06399352 0.07443366
## lasso             NA         NA
summary(log.election)
## 
## Call:
## glm(formula = candidate ~ ., family = "binomial", data = election.tr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6260  -0.2306  -0.0944  -0.0316   3.3541  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -5.094e+01  6.818e+00  -7.471 7.93e-14 ***
## TotalPop          1.561e-06  6.064e-07   2.574 0.010049 *  
## Men               3.035e-02  4.577e-02   0.663 0.507205    
## VotingAgeCitizen  1.852e-01  2.614e-02   7.085 1.39e-12 ***
## Income           -1.545e-05  1.598e-05  -0.967 0.333535    
## Poverty           5.377e-02  4.217e-02   1.275 0.202337    
## ChildPoverty      1.737e-04  2.459e-02   0.007 0.994364    
## Professional      2.945e-01  3.836e-02   7.679 1.60e-14 ***
## Service           3.274e-01  4.522e-02   7.240 4.50e-13 ***
## Office            1.256e-01  4.787e-02   2.624 0.008679 ** 
## Production        1.553e-01  4.031e-02   3.852 0.000117 ***
## Drive            -1.192e-01  3.803e-02  -3.134 0.001722 ** 
## Carpool          -8.829e-02  4.986e-02  -1.771 0.076607 .  
## Transit           1.640e-01  8.798e-02   1.864 0.062388 .  
## OtherTransp       1.515e-01  9.291e-02   1.631 0.102977    
## WorkAtHome        4.633e-02  6.132e-02   0.756 0.449853    
## MeanCommute       2.393e-02  2.337e-02   1.024 0.305770    
## Employed          2.683e-01  3.278e-02   8.186 2.71e-16 ***
## PrivateWork       6.754e-02  2.064e-02   3.272 0.001067 ** 
## SelfEmployed     -3.618e-02  4.570e-02  -0.792 0.428554    
## FamilyWork       -5.582e-01  2.954e-01  -1.890 0.058782 .  
## Unemployment      2.143e-01  4.681e-02   4.579 4.68e-06 ***
## Minority          1.316e-01  9.578e-03  13.737  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2200.56  on 2468  degrees of freedom
## Residual deviance:  813.06  on 2446  degrees of freedom
## AIC: 859.06
## 
## Number of Fisher Scoring iterations: 7

From the p-values for each of our coefficients, we can see that the most significant variables in our logistic regression are “Service”, “Professional”, “Employed”, “Unemployment” “VotingAgeCitizen”, “Production”, aand “Minority”, in descending order of significance. This is fairly consistent with our decision tree analysis, as our pruned decision tree had splits at variables like “Minority”, “Service”, “Professional”, and “Unemployment”. All of these variables have positive coefficients, so are significant in that as these variables increase, the response variable increases as well. Due to the way we have coded our data, this means that as the percentage that these variables are represented in the county’s population increases, the likelihood that Joe Biden would be the favored candidate of this county increases as well. A variable like “FamilyWork” has large magnitude but is negative, so as the percentage that workers in “FamilyWork” were to increase in the county, the likelihood that Donald Trump would be the favored candidate of the county would increase.

17. You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner).This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization.

Use the cv.glmnet function from the glmnet library to run a 10-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty. Set lambda = seq(1, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter \(\lambda\).

What is the optimal value of \(\lambda\) in cross validation? What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\) ? How do they compare to the unpenalized logistic regression? Comment on the comparison. Save training and test errors to the records variable.

set.seed(123)
#create lasso training and test data
lasso.train <- as.matrix(election.tr %>% select(-candidate))
lasso.test <- as.matrix(election.te %>% select(-candidate))

#lasso regression
lambda = seq(1, 50) * 1e-4
lasso.election = cv.glmnet(lasso.train, droplevels(election.tr$candidate), alpha = 1, lambda = lambda, family="binomial")

#best lambda value
bestlam = lasso.election$lambda.min
bestlam
## [1] 0.0012
#lasso coefficient estimates
out = glmnet(lasso.train, election.tr$candidate, alpha=1, family="binomial")
lasso_coef = predict(out, type='coefficients', s=bestlam)
lasso_coef
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)      -4.215486e+01
## TotalPop          1.581243e-06
## Men               .           
## VotingAgeCitizen  1.768727e-01
## Income            .           
## Poverty           5.955798e-02
## ChildPoverty      .           
## Professional      2.308346e-01
## Service           2.590757e-01
## Office            7.159235e-02
## Production        9.207544e-02
## Drive            -1.019722e-01
## Carpool          -7.135310e-02
## Transit           1.420269e-01
## OtherTransp       1.266704e-01
## WorkAtHome        2.701713e-02
## MeanCommute       7.072239e-03
## Employed          2.248092e-01
## PrivateWork       6.077757e-02
## SelfEmployed     -3.591103e-02
## FamilyWork       -4.613668e-01
## Unemployment      1.873725e-01
## Minority          1.193081e-01

We can see that for this lasso regression, the optimal value of \(\lambda\) is 0.0012, and that for this value of \(\lambda\) most of our variables have non-zero coefficients, with the exception of “Men”, “Income”, and “ChildPoverty”.

#training error
lasso.prob.train = predict(lasso.election, s=bestlam, newx = lasso.train)
lasso.train.error = calc_error_rate(ifelse(lasso.prob.train <= 0.4, "Donald Trump", "Joe Biden"), election.tr$candidate)
lasso.train.error
## [1] 0.0708789
#test error
lasso.prob.test = predict(lasso.election, s=bestlam, newx = lasso.test)
lasso.test.error = calc_error_rate(ifelse(lasso.prob.test <= 0.4, "Donald Trump", "Joe Biden"), election.te$candidate)
lasso.test.error
## [1] 0.0776699
#save lasso regression training and test error in records
records[3,1] = lasso.train.error
records[3,2] = lasso.test.error
records
##          train.error test.error
## tree      0.06520859 0.10032362
## logistic  0.06399352 0.07443366
## lasso     0.07087890 0.07766990

We can also see that the same variables are most significant between the logistic regression and lasso, with “Service”, “Professional”, and “Unemployed” as the top 3 most significant variables with the largest positive coefficients, and “FamilyWork” with the largest negative coefficient.

  1. Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?
set.seed(123)
#pruned decision tree ROC
tree.prob.test = predict(prune.election, election.te)[,2]
tree.pred = prediction(tree.prob.test, election.te$candidate)
tree.perf = performance(tree.pred, measure="tpr", x.measure="fpr")
plot(tree.perf, col=2, lwd=3, main="Pruned Decision Tree ROC curve")
abline(0,1)

#pruned decision tree AUC
tree.auc = performance(tree.pred, "auc")@y.values
tree.auc
## [[1]]
## [1] 0.8835756
#logistic regression ROC
log.prob.test = predict(log.election, election.te, type="response")
log.pred = prediction(log.prob.test, election.te$candidate)
log.perf = performance(log.pred, measure="tpr", x.measure="fpr")
plot(log.perf, col="blue", lwd=3, main="Logistic Regression ROC curve")
abline(0,1)

#pruned decision tree AUC
log.auc = performance(log.pred, "auc")@y.values
log.auc
## [[1]]
## [1] 0.9629435
#lasso regression ROC
lasso.prob.test = predict(lasso.election, s=bestlam, newx = lasso.test)
lasso.pred = prediction(lasso.prob.test, election.te$candidate)
lasso.perf = performance(lasso.pred, measure="tpr", x.measure="fpr")
plot(lasso.perf, col="green", lwd=3, main="LASSO ROC curve")
abline(0,1)

#pruned decision tree AUC
lasso.auc = performance(lasso.pred, "auc")@y.values
lasso.auc
## [[1]]
## [1] 0.9614254

From observing the ROC curves, it is clear that of the three, the pruned decision tree has the worst ROC curve since the area under the curve is cleary the smallest. We confirm this by solving for the AUC and seeing that it is indeed of the lowest magnitude. It appears that the Logistic Regression ROC curve has the best ROC curve with the largest AUC, since it is a smoother curve than the lasso ROC. Calculating the AUC’s of these two confirms that the logistic regression ROC has the largest area under the curve, which directy means that it performs the best of the three methods. We can further confirm this hypothesis by observing the values stored in the “records” table. We also see that the logistic regression has the smallest training and testing error of the three methods, in the case where the lasso and logistic regression error rates were determined using a probability threshold of 0.4. While the decision tree has a smaller training error than the lasso, its test error is significantly larger than lasso’s, making it overall a worse model.

Each of these methods have their own pros and cons associated with them, and handle data differently. Decision trees repeatedly divide a space into smaller and smaller portions, while logistic regression fits a single line to divide the space into 2 portions. Decision trees are the better method to use when there is a clear non-linear boundary between two classes in a dataset. However, logistic regression is favorable in the case where that boundary is not very clear, like in our election classifier. Decision Trees paint a clearer picture of the overall story of how the data can be interpreted, but logistic regression can be argued as easier to interpret since one just has to look at the magnitude and signs of the coefficients. However, logistic regression and decision trees have the tendency to overfit data, which is where lasso has the advantage as a regularization method to reduce overfitting. It accomplishes this by only selecting a subset of predictor variables to use.

Each of these different classifiers could help us answer different questions about the election. The logistic regression thrives in answering questions such as the one we worked on in this project, for predicting the election winner per county. The LASSO thrived with this question as well. However, the decision tree would be useful in predicting something like percentage of Unemployment in a county if given all other census columns to use. Since there is an interesting relationship between unemployment and some of the other predictor variables, one that is not collinear but is probably related to Poverty or Employment, there would be an interesting non-linear boundary present that could be well-parsed with a decision tree.

###Taking it further

19. Explore additional classification methods. Consider applying additional two classification methods from KNN, LDA, QDA, SVM, random forest, boosting, neural networks etc. (You may research and use methods beyond those covered in this course). How do these compare to the tree method, logistic regression, and the lasso logistic regression?

We decided to additionally run a boosting model first.

set.seed(123)
#boosting
boost.election = gbm(ifelse(candidate == "Donald Trump",0,1) ~ ., data = election.tr, distribution = "bernoulli", n.trees=1000, interaction.depth=2)
summary(boost.election)
#boosting training error
yhat.boost.train = predict(boost.election, newdata = election.tr, n.trees=1000, type = "response")
boost.train.error = calc_error_rate(ifelse(yhat.boost.train <= 0.4, "Donald Trump", "Joe Biden"), election.tr$candidate)
boost.train.error
## [1] 0.001215067
#boosting test error
yhat.boost.test = predict(boost.election, newdata = election.te, n.trees=1000, type = "response")
boost.test.error = calc_error_rate(ifelse(yhat.boost.test <= 0.4, "Donald Trump", "Joe Biden"), election.te$candidate)
boost.test.error
## [1] 0.07281553
#add boosting to records
records = rbind(records, boosting = c(boost.train.error, boost.test.error))

#ROC curve
pred.boost = prediction(yhat.boost.test, election.te$candidate)
perf.boost = performance(pred.boost, measure="tpr", x.measure="fpr")
plot(perf.boost, col="purple", lwd=3, main="Boosting ROC Curve")
abline(0,1)

#boosting AUC
boost.auc = performance(pred.boost, "auc")@y.values
boost.auc
## [[1]]
## [1] 0.9630019

This model supports the idea that “Minority” is the most important variable of the dataset. This directly supports the pruned decision tree model output, which repeatedly used “Minority” in splits of the tree. This model also has the smallest training and testing errors of the three previous models, as well as the largest AUC.

We additionally decide to run a kNN classifier on our data, with k=10 nearest neighbors.

set.seed(123)
#kNN
knn.train = election.tr %>% select(-candidate) %>% scale(center=TRUE, scale=TRUE)
knn.test = election.te %>% select(-candidate) %>% scale(center=TRUE, scale=TRUE)

#kNN training error
pred.knn.train = knn(train=knn.train, test=knn.train, cl=election.tr$candidate, k=10)
knn.train.error = calc_error_rate(pred.knn.train, election.tr$candidate)
knn.train.error
## [1] 0.07330903
#kNN training error
pred.knn.test = knn(train=knn.train, test=knn.test, cl=election.tr$candidate, k=10)
knn.test.error = calc_error_rate(pred.knn.test, election.te$candidate)
knn.test.error
## [1] 0.08414239
#add knn to records
records = rbind(records, kNN = c(knn.train.error, knn.test.error))

#ROC curve
prob.knn.test = knn(train=knn.train, test=knn.test, cl=election.tr$candidate, k=10, prob=TRUE)
prob = attr(prob.knn.test, "prob")
prob = 1-prob
pred.knn = prediction(prob, election.te$candidate)
perf.knn = performance(pred.knn, measure="tpr", x.measure="fpr")
plot(perf.knn, col="pink", lwd=3, main="kNN ROC Curve")
abline(0,1)

#kNN AUC
knn.auc = performance(pred.knn, "auc")@y.values
knn.auc
## [[1]]
## [1] 0.8401257
records
##          train.error test.error
## tree     0.065208586 0.10032362
## logistic 0.063993520 0.07443366
## lasso    0.070878898 0.07766990
## boosting 0.001215067 0.07281553
## kNN      0.073309032 0.08414239

This yielded our highest training error of the 5 models, and the second largest test error. It also has a very poor ROC curve, with the smallest AUC of the 5 models. Given what we can see from the boosting and kNN models, we can rank all 5 of our models with boosting as the best, following by logistic regression, lasso regression, decision tree, and finally kNN as the worst.

20. Tackle at least one more interesting question. Creative and thoughtful analysis will be rewarded! Some possibilities for further exploration are:

Consider a regression problem! Use linear regression models to predict the total vote for each candidate by county. Compare and contrast these results with the classification models. Which do you prefer and why? How might they complement one another?

We decided to construct linear regression models to predict the total vote for each candidate by county. This requires pulling data directly from election.raw, since this dataset holds the number of votes that each candidate received in each county. From here, we made the “state” and “county” columns of election.raw all lowercase and filtered election.raw to only hold information pertaining to only Joe Biden and Donald Trump, since we decided to focus on linear regressions for just these 2 candidates. We also made the “State” and “County” columns of census.clean lowercase as well. From here, we did 2 left joins between tmpwinner, i.e. our mutated election.raw, and census.clean, with the dataset election.reg.biden containing data pertaining to Joe Biden, and election.reg.trump with data for Donald Trump. We dropped the levels on the candidates, and refined our 2 regression datasets by removing the county, party, CountyId, state, and candidate columns, since our response is “votes” and predictors are all of the census columns relating to population.

set.seed(123)
#linear regression models to predict total vote for each candidate by county
# we move all state and county names into lower-case
tmpwinner <- election.raw %>%
  mutate_at(vars(state, county), tolower) %>%
  subset(candidate == "Joe Biden")

# we move all state and county names into lower-case
# we further remove suffixes of "county" and "parish"
tmpcensus <- census.clean %>% mutate_at(vars(State, County), tolower) %>%
  mutate(County = gsub(" county|  parish", "", County)) 

# we join the two datasets
election.reg.biden <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

# drop levels of candidate
election.reg.biden$candidate <- droplevels(election.reg.biden$candidate)

## save meta information
election.meta.biden <- election.reg.biden %>% select(c(county, party, CountyId, state, candidate))

## save predictors and class labels
election.reg.biden = election.reg.biden %>% select(-c(county, party, CountyId, state, candidate))

We then proceeded to construct a linear regression model to predict the number of votes Joe Biden gets per county, with votes as the response and all cof the other columns as the predictors.

set.seed(123)
#obtain training dataset
n <- nrow(election.reg.biden)
idx.tr <- sample.int(n, 0.8*n) 
votes.biden.tr <- election.reg.biden[idx.tr, ]

#linear regression on total dataset
lin.votes.bidentot = lm(votes ~., data = election.reg.biden)
summary(lin.votes.bidentot)
## 
## Call:
## lm(formula = votes ~ ., data = election.reg.biden)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -281897   -3208     275    3638  357772 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.115e+05  2.623e+04  -4.251 2.19e-05 ***
## TotalPop          2.526e-01  1.375e-03 183.719  < 2e-16 ***
## Men               2.233e+02  1.896e+02   1.178  0.23886    
## VotingAgeCitizen  6.538e+02  1.098e+02   5.953 2.94e-09 ***
## Income           -4.747e-02  7.032e-02  -0.675  0.49969    
## Poverty           9.562e+01  1.893e+02   0.505  0.61360    
## ChildPoverty      5.495e+01  1.047e+02   0.525  0.59989    
## Professional      5.543e+02  1.371e+02   4.043 5.40e-05 ***
## Service          -2.417e+02  1.641e+02  -1.473  0.14091    
## Office           -5.642e+02  1.727e+02  -3.267  0.00110 ** 
## Production        2.493e+01  1.385e+02   0.180  0.85711    
## Drive             1.585e+02  1.780e+02   0.891  0.37323    
## Carpool           4.669e+02  2.282e+02   2.047  0.04078 *  
## Transit          -3.513e+02  2.597e+02  -1.353  0.17631    
## OtherTransp       1.434e+03  4.038e+02   3.551  0.00039 ***
## WorkAtHome        6.205e+02  2.737e+02   2.267  0.02347 *  
## MeanCommute      -1.743e+01  9.248e+01  -0.188  0.85054    
## Employed          3.065e+02  1.184e+02   2.588  0.00971 ** 
## PrivateWork       1.490e+02  8.844e+01   1.684  0.09223 .  
## SelfEmployed     -2.435e+02  1.717e+02  -1.419  0.15613    
## FamilyWork       -2.313e+02  9.392e+02  -0.246  0.80546    
## Unemployment     -8.340e+01  1.959e+02  -0.426  0.67033    
## Minority          1.467e+02  3.066e+01   4.784 1.80e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20890 on 3066 degrees of freedom
## Multiple R-squared:  0.9415, Adjusted R-squared:  0.9411 
## F-statistic:  2244 on 22 and 3066 DF,  p-value: < 2.2e-16

We note a large \(R^2\) value, a large F-statistic, and a small associated aggregate p-value. We also can see that there are many variables with very large individual p values, so know that our dataset needs to be cleaned up to run the most efficient linear regression, so we need to decide on which variables are important to keep. Before this, we note that there is a very large residual standard error, meaning that the data does not fit very well to our model. We must first check and see why this is the case.

set.seed(123)
#Residuals vs Fit
residuals.biden = residuals(lin.votes.bidentot)
fit.biden = fitted(lin.votes.bidentot)
plot(fit.biden, residuals.biden, xlab = "Fitted Values", ylab = "Residuals", main = "Residual vs Fit")

#Q-Q Plot
qqnorm(residuals.biden) #skewed
qqline(residuals.biden, col="red")

We can see from the Residuals vs Fit plot that the data is very clustered in one area, and then fans out. This hints to the data not having a constant error variance. We can also see from the Q-Q plot that the residuals are very heavy-tailed, and not normally distributed since they don’t really follow the red normal line. Since we have problems of non-normality and unequal error variance, we can improve our linear regression by transforming the Y values with the help of a box-cox transformation. We cannot have y values of 0 in the boxCox function, so we find the optimal transformation value using a subset of votes.biden.tr excluding rows where “votes” = 0.

set.seed(123)
#boxCox
boxbidentr = votes.biden.tr %>% subset(votes != 0)

boxcox <- boxCox(boxbidentr$votes ~ boxbidentr$TotalPop + boxbidentr$Men + boxbidentr$Income + boxbidentr$Poverty + boxbidentr$ChildPoverty + boxbidentr$Professional + boxbidentr$Service + boxbidentr$Drive + boxbidentr$Carpool + boxbidentr$Transit + boxbidentr$OtherTransp + boxbidentr$VotingAgeCitizen + boxbidentr$WorkAtHome + boxbidentr$MeanCommute + boxbidentr$Employed + boxbidentr$SelfEmployed + boxbidentr$FamilyWork + boxbidentr$Unemployment + boxbidentr$Minority + boxbidentr$Production + boxbidentr$Office + boxbidentr$PrivateWork)

lambda <- boxcox$x 
log_likelihood <- boxcox$y
boxcox_data <- cbind(lambda, log_likelihood)
sorted_boxcox_data <- boxcox_data[order(-log_likelihood),] 
head(sorted_boxcox_data, n=10)
##            lambda log_likelihood
##  [1,]  0.06060606      -29738.34
##  [2,]  0.10101010      -29744.49
##  [3,]  0.02020202      -29753.33
##  [4,]  0.14141414      -29770.80
##  [5,] -0.02020202      -29790.43
##  [6,]  0.18181818      -29816.31
##  [7,] -0.06060606      -29850.59
##  [8,]  0.22222222      -29880.06
##  [9,] -0.10101010      -29934.76
## [10,]  0.26262626      -29961.10

We can see that our boxCox suggests transforming votes by the power of 0.06061, since that is the value of lambda associated with the largest log-likelihood. We can transform our linear regression and see if there is any improvement

set.seed(123)
#transformed linear regression
trans.lin.votes.biden = lm(votes^0.06061 ~ ., data = votes.biden.tr)

#updated graphs
#Residuals vs Fit
residuals.biden = residuals(trans.lin.votes.biden)
fit.biden = fitted(trans.lin.votes.biden)
plot(fit.biden, residuals.biden, xlab = "Fitted Values", ylab = "Residuals", main = "Residual vs Fit")

#Q-Q Plot
qqnorm(residuals.biden) #skewed
qqline(residuals.biden, col="red")

summary(trans.lin.votes.biden)
## 
## Call:
## lm(formula = votes^0.06061 ~ ., data = votes.biden.tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55615 -0.04678  0.00598  0.05584  0.29556 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.377e-01  1.308e-01  -5.638 1.92e-08 ***
## TotalPop          1.887e-07  8.361e-09  22.571  < 2e-16 ***
## Men              -2.234e-03  9.385e-04  -2.380 0.017384 *  
## VotingAgeCitizen  1.009e-04  5.433e-04   0.186 0.852744    
## Income            1.891e-06  3.478e-07   5.437 5.95e-08 ***
## Poverty           1.507e-03  9.388e-04   1.605 0.108549    
## ChildPoverty     -3.594e-04  5.152e-04  -0.698 0.485517    
## Professional      1.326e-02  6.846e-04  19.362  < 2e-16 ***
## Service           1.554e-02  8.136e-04  19.098  < 2e-16 ***
## Office            1.505e-02  8.483e-04  17.745  < 2e-16 ***
## Production        6.932e-03  6.782e-04  10.222  < 2e-16 ***
## Drive             5.547e-03  9.006e-04   6.159 8.53e-10 ***
## Carpool           7.033e-03  1.157e-03   6.079 1.40e-09 ***
## Transit           4.641e-03  1.290e-03   3.598 0.000328 ***
## OtherTransp       1.127e-02  2.023e-03   5.573 2.79e-08 ***
## WorkAtHome        9.277e-03  1.422e-03   6.526 8.20e-11 ***
## MeanCommute       9.624e-04  4.583e-04   2.100 0.035843 *  
## Employed          1.953e-03  5.832e-04   3.349 0.000824 ***
## PrivateWork       7.207e-03  4.431e-04  16.265  < 2e-16 ***
## SelfEmployed     -7.511e-03  8.654e-04  -8.679  < 2e-16 ***
## FamilyWork        7.656e-03  4.961e-03   1.543 0.122913    
## Unemployment      5.560e-03  9.719e-04   5.720 1.19e-08 ***
## Minority          1.348e-03  1.529e-04   8.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09278 on 2448 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:   0.75 
## F-statistic: 337.8 on 22 and 2448 DF,  p-value: < 2.2e-16

We can see a great improvement in our Residuals vs Fit and QQ plots. There is no more fanning in the Residuals vs Fit plot anymore, and there are very few outliers present. The tails have also been balanced in the Q-Q plot, and our data is now normally distributed. There is also a great decrease in RSE, supporting the fact that our data is better fit now, even though the \(R^2\) values have both decreased.

We can do this with a stepwise regression with the step() function, which selects a formula-based model for linear regression by AIC. We do this to try to simplify our model and decrease our predictors. We insert mod0, a linear model consisting of only the response “votes”, and mod1, a linear model consisting of ALL predictors against the response “votes”.

set.seed(123)
#stepwise regression biden - decide on important variables
mod0 <- lm(votes^0.06061 ~ 1, data = votes.biden.tr)
mod1 <- lm(votes^0.06061 ~ ., data = votes.biden.tr)
step(mod0, scope = list(lower=mod0, upper = mod1))
## Start:  AIC=-8323.15
## votes^0.06061 ~ 1
## 
##                    Df Sum of Sq    RSS     AIC
## + TotalPop          1   31.5969 53.453 -9468.8
## + SelfEmployed      1   22.5835 62.466 -9083.7
## + PrivateWork       1   19.2070 65.843 -8953.6
## + Office            1   16.6445 68.405 -8859.3
## + Income            1   12.2722 72.778 -8706.2
## + Transit           1    9.6206 75.429 -8617.8
## + Professional      1    7.6532 77.397 -8554.1
## + Minority          1    6.2856 78.764 -8510.9
## + FamilyWork        1    5.1790 79.871 -8476.4
## + Men               1    4.9812 80.069 -8470.3
## + MeanCommute       1    4.7901 80.260 -8464.4
## + VotingAgeCitizen  1    3.7009 81.349 -8431.1
## + WorkAtHome        1    3.5815 81.468 -8427.5
## + Production        1    3.5807 81.469 -8427.4
## + Employed          1    3.5754 81.474 -8427.3
## + Unemployment      1    1.4921 83.558 -8364.9
## + Carpool           1    1.1771 83.873 -8355.6
## + Service           1    0.7771 84.273 -8343.8
## + OtherTransp       1    0.6747 84.375 -8340.8
## + Poverty           1    0.5039 84.546 -8335.8
## + Drive             1    0.4017 84.648 -8332.8
## + ChildPoverty      1    0.3776 84.672 -8332.1
## <none>                          85.050 -8323.1
## 
## Step:  AIC=-9468.76
## votes^0.06061 ~ TotalPop
## 
##                    Df Sum of Sq    RSS      AIC
## + SelfEmployed      1   14.8933 38.560 -10273.8
## + PrivateWork       1   10.2261 43.227  -9991.5
## + Office            1    8.6251 44.828  -9901.6
## + Income            1    4.0248 49.428  -9660.2
## + WorkAtHome        1    3.1642 50.289  -9617.5
## + FamilyWork        1    3.0796 50.373  -9613.4
## + Drive             1    2.6121 50.841  -9590.6
## + Men               1    2.3593 51.094  -9578.3
## + Professional      1    1.6049 51.848  -9542.1
## + MeanCommute       1    1.3730 52.080  -9531.1
## + Minority          1    1.1273 52.326  -9519.4
## + Unemployment      1    1.0678 52.385  -9516.6
## + Employed          1    0.9546 52.498  -9511.3
## + Service           1    0.7091 52.744  -9499.8
## + Production        1    0.4816 52.971  -9489.1
## + Carpool           1    0.4636 52.989  -9488.3
## + Transit           1    0.3781 53.075  -9484.3
## + OtherTransp       1    0.1189 53.334  -9472.3
## + VotingAgeCitizen  1    0.1101 53.343  -9471.9
## + Poverty           1    0.0893 53.364  -9470.9
## + ChildPoverty      1    0.0630 53.390  -9469.7
## <none>                          53.453  -9468.8
## - TotalPop          1   31.5969 85.050  -8323.1
## 
## Step:  AIC=-10273.78
## votes^0.06061 ~ TotalPop + SelfEmployed
## 
##                    Df Sum of Sq    RSS      AIC
## + Professional      1    4.9400 33.620 -10610.5
## + Income            1    4.1571 34.403 -10553.7
## + Office            1    3.2793 35.280 -10491.4
## + Employed          1    2.7763 35.783 -10456.4
## + Production        1    2.4836 36.076 -10436.3
## + PrivateWork       1    1.5825 36.977 -10375.3
## + Men               1    1.4616 37.098 -10367.3
## + WorkAtHome        1    1.3467 37.213 -10359.6
## + Poverty           1    1.0583 37.501 -10340.5
## + ChildPoverty      1    1.0397 37.520 -10339.3
## + Carpool           1    0.5524 38.007 -10307.4
## + OtherTransp       1    0.2749 38.285 -10289.5
## + Transit           1    0.2701 38.290 -10289.1
## + FamilyWork        1    0.1571 38.403 -10281.9
## + Minority          1    0.0932 38.467 -10277.8
## + MeanCommute       1    0.0712 38.489 -10276.3
## + Drive             1    0.0379 38.522 -10274.2
## + Service           1    0.0369 38.523 -10274.1
## <none>                          38.560 -10273.8
## + Unemployment      1    0.0086 38.551 -10272.3
## + VotingAgeCitizen  1    0.0018 38.558 -10271.9
## - SelfEmployed      1   14.8933 53.453  -9468.8
## - TotalPop          1   23.9067 62.466  -9083.7
## 
## Step:  AIC=-10610.54
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional
## 
##                    Df Sum of Sq    RSS      AIC
## + Office            1    3.2246 30.395 -10857.7
## + PrivateWork       1    2.4808 31.139 -10797.9
## + Men               1    0.8779 32.742 -10673.9
## + Income            1    0.8083 32.811 -10668.7
## + Service           1    0.6875 32.932 -10659.6
## + Employed          1    0.5201 33.100 -10647.1
## + Minority          1    0.3701 33.250 -10635.9
## + Unemployment      1    0.3028 33.317 -10630.9
## + MeanCommute       1    0.2357 33.384 -10625.9
## + FamilyWork        1    0.1829 33.437 -10622.0
## + OtherTransp       1    0.1675 33.452 -10620.9
## + WorkAtHome        1    0.1083 33.512 -10616.5
## + Drive             1    0.0771 33.543 -10614.2
## + Production        1    0.0653 33.555 -10613.3
## + Poverty           1    0.0643 33.555 -10613.3
## + VotingAgeCitizen  1    0.0281 33.592 -10610.6
## <none>                          33.620 -10610.5
## + Carpool           1    0.0153 33.605 -10609.7
## + Transit           1    0.0111 33.609 -10609.4
## + ChildPoverty      1    0.0075 33.612 -10609.1
## - Professional      1    4.9400 38.560 -10273.8
## - TotalPop          1   15.5595 49.179  -9672.7
## - SelfEmployed      1   18.2284 51.848  -9542.1
## 
## Step:  AIC=-10857.69
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office
## 
##                    Df Sum of Sq    RSS      AIC
## + PrivateWork       1    1.8359 28.559 -11009.6
## + Service           1    0.9572 29.438 -10934.8
## + Income            1    0.5993 29.796 -10904.9
## + Minority          1    0.5224 29.873 -10898.5
## + Employed          1    0.4826 29.913 -10895.2
## + Men               1    0.4435 29.952 -10892.0
## + Unemployment      1    0.2767 30.118 -10878.3
## + OtherTransp       1    0.1817 30.213 -10870.5
## + Production        1    0.1501 30.245 -10867.9
## + VotingAgeCitizen  1    0.1484 30.247 -10867.8
## + WorkAtHome        1    0.1416 30.254 -10867.2
## + MeanCommute       1    0.1356 30.260 -10866.7
## + FamilyWork        1    0.1172 30.278 -10865.2
## + Transit           1    0.0954 30.300 -10863.5
## <none>                          30.395 -10857.7
## + Poverty           1    0.0119 30.383 -10856.7
## + Drive             1    0.0096 30.386 -10856.5
## + Carpool           1    0.0073 30.388 -10856.3
## + ChildPoverty      1    0.0002 30.395 -10855.7
## - Office            1    3.2246 33.620 -10610.5
## - Professional      1    4.8853 35.280 -10491.4
## - SelfEmployed      1   12.2816 42.677 -10021.1
## - TotalPop          1   13.0022 43.397  -9979.7
## 
## Step:  AIC=-11009.64
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork
## 
##                    Df Sum of Sq    RSS    AIC
## + Service           1    2.7449 25.814 -11257
## + Minority          1    2.0851 26.474 -11195
## + Unemployment      1    1.3889 27.170 -11131
## + ChildPoverty      1    0.3211 28.238 -11036
## + Poverty           1    0.3190 28.240 -11035
## + OtherTransp       1    0.3189 28.240 -11035
## + WorkAtHome        1    0.1726 28.387 -11023
## + VotingAgeCitizen  1    0.1677 28.392 -11022
## + Men               1    0.1510 28.408 -11021
## + Drive             1    0.0960 28.463 -11016
## + Transit           1    0.0922 28.467 -11016
## + Income            1    0.0914 28.468 -11016
## + MeanCommute       1    0.0872 28.472 -11015
## + Carpool           1    0.0577 28.502 -11013
## + Production        1    0.0541 28.505 -11012
## + FamilyWork        1    0.0360 28.523 -11011
## <none>                          28.559 -11010
## + Employed          1    0.0007 28.559 -11008
## - PrivateWork       1    1.8359 30.395 -10858
## - Office            1    2.5797 31.139 -10798
## - Professional      1    5.6462 34.205 -10566
## - SelfEmployed      1    5.7115 34.271 -10561
## - TotalPop          1   10.9983 39.558 -10207
## 
## Step:  AIC=-11257.33
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service
## 
##                    Df Sum of Sq    RSS    AIC
## + Minority          1    1.7068 24.108 -11424
## + Unemployment      1    1.0047 24.810 -11353
## + Production        1    0.7793 25.035 -11331
## + VotingAgeCitizen  1    0.5392 25.275 -11308
## + MeanCommute       1    0.2685 25.546 -11281
## + Income            1    0.2562 25.558 -11280
## + Men               1    0.2353 25.579 -11278
## + ChildPoverty      1    0.1686 25.646 -11272
## + Poverty           1    0.1554 25.659 -11270
## + OtherTransp       1    0.1229 25.692 -11267
## + Carpool           1    0.0964 25.718 -11265
## + WorkAtHome        1    0.0683 25.746 -11262
## <none>                          25.814 -11257
## + FamilyWork        1    0.0154 25.799 -11257
## + Transit           1    0.0103 25.804 -11256
## + Employed          1    0.0054 25.809 -11256
## + Drive             1    0.0004 25.814 -11255
## - Service           1    2.7449 28.559 -11010
## - Office            1    2.7508 28.565 -11009
## - SelfEmployed      1    3.0259 28.840 -10985
## - PrivateWork       1    3.6236 29.438 -10935
## - Professional      1    7.8461 33.660 -10604
## - TotalPop          1    9.0648 34.879 -10516
## 
## Step:  AIC=-11424.36
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority
## 
##                    Df Sum of Sq    RSS    AIC
## + Production        1    1.0135 23.094 -11528
## + Unemployment      1    0.4067 23.701 -11464
## + Income            1    0.3148 23.793 -11455
## + MeanCommute       1    0.2957 23.812 -11453
## + Men               1    0.1799 23.928 -11441
## + WorkAtHome        1    0.1138 23.994 -11434
## + Employed          1    0.0818 24.026 -11431
## + OtherTransp       1    0.0628 24.045 -11429
## + Carpool           1    0.0239 24.084 -11425
## <none>                          24.108 -11424
## + Drive             1    0.0091 24.098 -11423
## + Poverty           1    0.0061 24.101 -11423
## + ChildPoverty      1    0.0058 24.102 -11423
## + Transit           1    0.0026 24.105 -11423
## + FamilyWork        1    0.0010 24.107 -11422
## + VotingAgeCitizen  1    0.0001 24.107 -11422
## - SelfEmployed      1    1.4342 25.542 -11284
## - Minority          1    1.7068 25.814 -11257
## - Service           1    2.3666 26.474 -11195
## - Office            1    2.7736 26.881 -11157
## - PrivateWork       1    5.0798 29.187 -10954
## - TotalPop          1    5.6099 29.718 -10909
## - Professional      1    8.8905 32.998 -10651
## 
## Step:  AIC=-11528.49
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production
## 
##                    Df Sum of Sq    RSS    AIC
## + Income            1    0.5557 22.538 -11587
## + MeanCommute       1    0.4062 22.688 -11570
## + Unemployment      1    0.2392 22.855 -11552
## + WorkAtHome        1    0.1813 22.913 -11546
## + Employed          1    0.0958 22.998 -11537
## + Men               1    0.0907 23.003 -11536
## + Carpool           1    0.0803 23.014 -11535
## + Poverty           1    0.0773 23.017 -11535
## + OtherTransp       1    0.0661 23.028 -11534
## + ChildPoverty      1    0.0650 23.029 -11534
## + VotingAgeCitizen  1    0.0525 23.041 -11532
## <none>                          23.094 -11528
## + Transit           1    0.0028 23.091 -11527
## + Drive             1    0.0009 23.093 -11527
## + FamilyWork        1    0.0006 23.093 -11527
## - SelfEmployed      1    0.9490 24.043 -11431
## - Production        1    1.0135 24.108 -11424
## - Minority          1    1.9410 25.035 -11331
## - Service           1    3.3772 26.471 -11193
## - Office            1    3.7871 26.881 -11155
## - PrivateWork       1    3.9635 27.058 -11139
## - TotalPop          1    5.4916 28.586 -11003
## - Professional      1    7.0957 30.190 -10868
## 
## Step:  AIC=-11586.68
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income
## 
##                    Df Sum of Sq    RSS    AIC
## + Unemployment      1    0.5608 21.978 -11647
## + MeanCommute       1    0.2565 22.282 -11613
## + Men               1    0.2165 22.322 -11608
## + Poverty           1    0.1323 22.406 -11599
## + WorkAtHome        1    0.1183 22.420 -11598
## + ChildPoverty      1    0.0759 22.462 -11593
## + OtherTransp       1    0.0703 22.468 -11592
## + Carpool           1    0.0660 22.472 -11592
## <none>                          22.538 -11587
## + Transit           1    0.0144 22.524 -11586
## + FamilyWork        1    0.0039 22.535 -11585
## + Employed          1    0.0019 22.536 -11585
## + Drive             1    0.0009 22.538 -11585
## + VotingAgeCitizen  1    0.0000 22.538 -11585
## - Income            1    0.5557 23.094 -11528
## - SelfEmployed      1    0.9157 23.454 -11490
## - Production        1    1.2544 23.793 -11455
## - Minority          1    2.0568 24.595 -11373
## - PrivateWork       1    2.8308 25.369 -11296
## - Service           1    3.7928 26.331 -11204
## - Office            1    3.9766 26.515 -11187
## - TotalPop          1    5.2164 27.755 -11074
## - Professional      1    5.2544 27.793 -11071
## 
## Step:  AIC=-11646.94
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment
## 
##                    Df Sum of Sq    RSS    AIC
## + Men               1    0.1624 21.815 -11663
## + WorkAtHome        1    0.0920 21.886 -11655
## + Employed          1    0.0890 21.889 -11655
## + MeanCommute       1    0.0777 21.900 -11654
## + OtherTransp       1    0.0596 21.918 -11652
## + Carpool           1    0.0499 21.928 -11651
## + Transit           1    0.0230 21.955 -11648
## <none>                          21.978 -11647
## + Poverty           1    0.0095 21.968 -11646
## + FamilyWork        1    0.0082 21.969 -11646
## + VotingAgeCitizen  1    0.0047 21.973 -11646
## + ChildPoverty      1    0.0030 21.974 -11645
## + Drive             1    0.0012 21.976 -11645
## - SelfEmployed      1    0.5558 22.533 -11587
## - Unemployment      1    0.5608 22.538 -11587
## - Income            1    0.8772 22.855 -11552
## - Production        1    1.0710 23.049 -11531
## - Minority          1    1.2693 23.247 -11510
## - PrivateWork       1    3.1919 25.169 -11314
## - Service           1    3.4531 25.431 -11288
## - Office            1    3.5201 25.498 -11282
## - TotalPop          1    5.0587 27.036 -11137
## - Professional      1    5.0928 27.070 -11134
## 
## Step:  AIC=-11663.27
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men
## 
##                    Df Sum of Sq    RSS    AIC
## + WorkAtHome        1    0.1087 21.706 -11674
## + MeanCommute       1    0.0691 21.746 -11669
## + OtherTransp       1    0.0578 21.757 -11668
## + Carpool           1    0.0437 21.771 -11666
## + Employed          1    0.0293 21.786 -11665
## + Transit           1    0.0264 21.789 -11664
## <none>                          21.815 -11663
## + FamilyWork        1    0.0077 21.808 -11662
## + Poverty           1    0.0058 21.809 -11662
## + ChildPoverty      1    0.0013 21.814 -11661
## + Drive             1    0.0008 21.814 -11661
## + VotingAgeCitizen  1    0.0004 21.815 -11661
## - Men               1    0.1624 21.978 -11647
## - Unemployment      1    0.5067 22.322 -11608
## - SelfEmployed      1    0.6036 22.419 -11598
## - Income            1    0.9848 22.800 -11556
## - Production        1    0.9982 22.813 -11555
## - Minority          1    1.2519 23.067 -11527
## - PrivateWork       1    2.7439 24.559 -11372
## - Office            1    3.2163 25.031 -11325
## - Service           1    3.4940 25.309 -11298
## - Professional      1    4.4057 26.221 -11211
## - TotalPop          1    5.0508 26.866 -11151
## 
## Step:  AIC=-11673.62
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome
## 
##                    Df Sum of Sq    RSS    AIC
## + MeanCommute       1    0.0658 21.641 -11679
## + Carpool           1    0.0560 21.650 -11678
## + OtherTransp       1    0.0450 21.661 -11677
## + Drive             1    0.0349 21.672 -11676
## + Transit           1    0.0238 21.683 -11674
## + Employed          1    0.0208 21.686 -11674
## <none>                          21.706 -11674
## + Poverty           1    0.0055 21.701 -11672
## + ChildPoverty      1    0.0015 21.705 -11672
## + FamilyWork        1    0.0010 21.705 -11672
## + VotingAgeCitizen  1    0.0008 21.706 -11672
## - WorkAtHome        1    0.1087 21.815 -11663
## - Men               1    0.1792 21.886 -11655
## - Unemployment      1    0.4770 22.183 -11622
## - SelfEmployed      1    0.6946 22.401 -11598
## - Income            1    0.9016 22.608 -11575
## - Production        1    1.0380 22.744 -11560
## - Minority          1    1.2994 23.006 -11532
## - PrivateWork       1    2.7466 24.453 -11381
## - Office            1    3.2678 24.974 -11329
## - Service           1    3.4305 25.137 -11313
## - Professional      1    4.0982 25.805 -11248
## - TotalPop          1    5.0377 26.744 -11160
## 
## Step:  AIC=-11679.12
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute
## 
##                    Df Sum of Sq    RSS    AIC
## + Employed          1    0.0581 21.582 -11684
## + Carpool           1    0.0521 21.589 -11683
## + OtherTransp       1    0.0502 21.590 -11683
## + Transit           1    0.0403 21.600 -11682
## + Drive             1    0.0332 21.607 -11681
## <none>                          21.641 -11679
## + VotingAgeCitizen  1    0.0105 21.630 -11678
## + FamilyWork        1    0.0026 21.638 -11677
## + Poverty           1    0.0026 21.638 -11677
## + ChildPoverty      1    0.0000 21.641 -11677
## - MeanCommute       1    0.0658 21.706 -11674
## - WorkAtHome        1    0.1054 21.746 -11669
## - Men               1    0.1701 21.811 -11662
## - Unemployment      1    0.3280 21.969 -11644
## - SelfEmployed      1    0.6460 22.287 -11608
## - Income            1    0.7028 22.343 -11602
## - Production        1    1.0796 22.720 -11561
## - Minority          1    1.3521 22.993 -11531
## - PrivateWork       1    2.7455 24.386 -11386
## - Office            1    3.2968 24.937 -11331
## - Service           1    3.4941 25.135 -11311
## - Professional      1    4.1147 25.755 -11251
## - TotalPop          1    4.7559 26.396 -11190
## 
## Step:  AIC=-11683.76
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed
## 
##                    Df Sum of Sq    RSS    AIC
## + Transit           1    0.0527 21.530 -11688
## + OtherTransp       1    0.0509 21.532 -11688
## + Carpool           1    0.0500 21.532 -11688
## + Drive             1    0.0447 21.538 -11687
## <none>                          21.582 -11684
## + VotingAgeCitizen  1    0.0114 21.571 -11683
## + Poverty           1    0.0087 21.574 -11683
## + ChildPoverty      1    0.0033 21.579 -11682
## + FamilyWork        1    0.0015 21.581 -11682
## - Employed          1    0.0581 21.641 -11679
## - Men               1    0.0854 21.668 -11676
## - WorkAtHome        1    0.0904 21.673 -11675
## - MeanCommute       1    0.1031 21.686 -11674
## - Income            1    0.3579 21.940 -11645
## - Unemployment      1    0.3861 21.969 -11642
## - SelfEmployed      1    0.6777 22.260 -11609
## - Production        1    1.0364 22.619 -11570
## - Minority          1    1.3812 22.964 -11532
## - PrivateWork       1    2.4465 24.029 -11420
## - Office            1    3.3305 24.913 -11331
## - Service           1    3.4273 25.010 -11322
## - Professional      1    3.9769 25.559 -11268
## - TotalPop          1    4.7026 26.285 -11199
## 
## Step:  AIC=-11687.81
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit
## 
##                    Df Sum of Sq    RSS    AIC
## + OtherTransp       1    0.0688 21.461 -11694
## + Carpool           1    0.0434 21.486 -11691
## <none>                          21.530 -11688
## + Poverty           1    0.0144 21.515 -11688
## + VotingAgeCitizen  1    0.0141 21.516 -11687
## + Drive             1    0.0061 21.524 -11686
## + ChildPoverty      1    0.0055 21.524 -11686
## + FamilyWork        1    0.0018 21.528 -11686
## - Transit           1    0.0527 21.582 -11684
## - Employed          1    0.0705 21.600 -11682
## - Men               1    0.0819 21.612 -11680
## - WorkAtHome        1    0.0849 21.615 -11680
## - MeanCommute       1    0.1297 21.659 -11675
## - Income            1    0.3542 21.884 -11650
## - Unemployment      1    0.3943 21.924 -11645
## - SelfEmployed      1    0.6668 22.197 -11614
## - Production        1    1.0444 22.574 -11573
## - Minority          1    1.4281 22.958 -11531
## - PrivateWork       1    2.4681 23.998 -11422
## - Office            1    3.2349 24.765 -11344
## - Service           1    3.4800 25.010 -11320
## - Professional      1    4.0277 25.557 -11266
## - TotalPop          1    4.5778 26.108 -11213
## 
## Step:  AIC=-11693.72
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit + OtherTransp
## 
##                    Df Sum of Sq    RSS    AIC
## + Carpool           1    0.0427 21.418 -11697
## + Drive             1    0.0337 21.427 -11696
## <none>                          21.461 -11694
## + VotingAgeCitizen  1    0.0138 21.447 -11693
## + Poverty           1    0.0125 21.448 -11693
## + ChildPoverty      1    0.0063 21.455 -11692
## + FamilyWork        1    0.0022 21.459 -11692
## - OtherTransp       1    0.0688 21.530 -11688
## - WorkAtHome        1    0.0702 21.531 -11688
## - Transit           1    0.0706 21.532 -11688
## - Employed          1    0.0739 21.535 -11687
## - Men               1    0.0786 21.540 -11687
## - MeanCommute       1    0.1432 21.604 -11679
## - Income            1    0.3515 21.812 -11656
## - Unemployment      1    0.3856 21.846 -11652
## - SelfEmployed      1    0.6600 22.121 -11621
## - Production        1    1.0510 22.512 -11578
## - Minority          1    1.3981 22.859 -11540
## - PrivateWork       1    2.4713 23.932 -11426
## - Office            1    3.2221 24.683 -11350
## - Service           1    3.4074 24.868 -11332
## - Professional      1    4.0224 25.483 -11271
## - TotalPop          1    4.5819 26.043 -11218
## 
## Step:  AIC=-11696.64
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit + OtherTransp + Carpool
## 
##                    Df Sum of Sq    RSS    AIC
## + Drive             1    0.2957 21.123 -11729
## <none>                          21.418 -11697
## + Poverty           1    0.0109 21.407 -11696
## + ChildPoverty      1    0.0060 21.412 -11695
## + VotingAgeCitizen  1    0.0057 21.413 -11695
## + FamilyWork        1    0.0025 21.416 -11695
## - Carpool           1    0.0427 21.461 -11694
## - Transit           1    0.0630 21.481 -11691
## - OtherTransp       1    0.0681 21.486 -11691
## - Employed          1    0.0708 21.489 -11690
## - Men               1    0.0764 21.495 -11690
## - WorkAtHome        1    0.0795 21.498 -11690
## - MeanCommute       1    0.1349 21.553 -11683
## - Income            1    0.3432 21.761 -11659
## - Unemployment      1    0.3748 21.793 -11656
## - SelfEmployed      1    0.6677 22.086 -11623
## - Production        1    1.0880 22.506 -11576
## - Minority          1    1.3375 22.756 -11549
## - PrivateWork       1    2.4971 23.915 -11426
## - Office            1    3.2560 24.674 -11349
## - Service           1    3.4487 24.867 -11330
## - Professional      1    3.9958 25.414 -11276
## - TotalPop          1    4.5511 25.969 -11223
## 
## Step:  AIC=-11728.99
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit + OtherTransp + Carpool + Drive
## 
##                    Df Sum of Sq    RSS    AIC
## + Poverty           1    0.0237 21.099 -11730
## + FamilyWork        1    0.0198 21.103 -11729
## <none>                          21.123 -11729
## + ChildPoverty      1    0.0060 21.117 -11728
## + VotingAgeCitizen  1    0.0005 21.122 -11727
## - MeanCommute       1    0.0450 21.168 -11726
## - Men               1    0.0572 21.180 -11724
## - Employed          1    0.0925 21.215 -11720
## - Transit           1    0.0980 21.221 -11720
## - OtherTransp       1    0.2526 21.375 -11702
## - Income            1    0.2844 21.407 -11698
## - Drive             1    0.2957 21.418 -11697
## - Carpool           1    0.3047 21.427 -11696
## - Unemployment      1    0.3530 21.476 -11690
## - WorkAtHome        1    0.3541 21.477 -11690
## - SelfEmployed      1    0.6980 21.821 -11651
## - Production        1    0.9810 22.104 -11619
## - Minority          1    1.1466 22.269 -11600
## - PrivateWork       1    2.2734 23.396 -11478
## - Office            1    2.8941 24.017 -11414
## - Service           1    3.5667 24.689 -11346
## - Professional      1    3.9941 25.117 -11303
## - TotalPop          1    4.5084 25.631 -11253
## 
## Step:  AIC=-11729.77
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit + OtherTransp + Carpool + Drive + Poverty
## 
##                    Df Sum of Sq    RSS    AIC
## + FamilyWork        1    0.0198 21.079 -11730
## <none>                          21.099 -11730
## - Poverty           1    0.0237 21.123 -11729
## + ChildPoverty      1    0.0034 21.096 -11728
## + VotingAgeCitizen  1    0.0000 21.099 -11728
## - MeanCommute       1    0.0428 21.142 -11727
## - Men               1    0.0480 21.147 -11726
## - Transit           1    0.0993 21.198 -11720
## - Employed          1    0.1074 21.206 -11719
## - OtherTransp       1    0.2554 21.354 -11702
## - Income            1    0.2773 21.376 -11700
## - Unemployment      1    0.2910 21.390 -11698
## - Drive             1    0.3085 21.407 -11696
## - Carpool           1    0.3122 21.411 -11696
## - WorkAtHome        1    0.3628 21.462 -11690
## - SelfEmployed      1    0.6580 21.757 -11656
## - Minority          1    0.8777 21.977 -11631
## - Production        1    0.9429 22.042 -11624
## - PrivateWork       1    2.2883 23.387 -11477
## - Office            1    2.8872 23.986 -11415
## - Service           1    3.5520 24.651 -11347
## - Professional      1    3.7422 24.841 -11328
## - TotalPop          1    4.5271 25.626 -11251
## 
## Step:  AIC=-11730.09
## votes^0.06061 ~ TotalPop + SelfEmployed + Professional + Office + 
##     PrivateWork + Service + Minority + Production + Income + 
##     Unemployment + Men + WorkAtHome + MeanCommute + Employed + 
##     Transit + OtherTransp + Carpool + Drive + Poverty + FamilyWork
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          21.079 -11730
## - FamilyWork        1    0.0198 21.099 -11730
## - Poverty           1    0.0238 21.103 -11729
## + ChildPoverty      1    0.0039 21.075 -11729
## + VotingAgeCitizen  1    0.0001 21.079 -11728
## - MeanCommute       1    0.0435 21.123 -11727
## - Men               1    0.0472 21.126 -11727
## - Employed          1    0.1038 21.183 -11720
## - Transit           1    0.1092 21.188 -11719
## - OtherTransp       1    0.2672 21.346 -11701
## - Income            1    0.2837 21.363 -11699
## - Unemployment      1    0.2901 21.369 -11698
## - Drive             1    0.3259 21.405 -11694
## - Carpool           1    0.3281 21.407 -11694
## - WorkAtHome        1    0.3645 21.444 -11690
## - SelfEmployed      1    0.6639 21.743 -11656
## - Minority          1    0.8849 21.964 -11630
## - Production        1    0.9538 22.033 -11623
## - PrivateWork       1    2.3080 23.387 -11475
## - Office            1    2.8973 23.976 -11414
## - Service           1    3.5696 24.649 -11346
## - Professional      1    3.7610 24.840 -11326
## - TotalPop          1    4.5010 25.580 -11254
## 
## Call:
## lm(formula = votes^0.06061 ~ TotalPop + SelfEmployed + Professional + 
##     Office + PrivateWork + Service + Minority + Production + 
##     Income + Unemployment + Men + WorkAtHome + MeanCommute + 
##     Employed + Transit + OtherTransp + Carpool + Drive + Poverty + 
##     FamilyWork, data = votes.biden.tr)
## 
## Coefficients:
##  (Intercept)      TotalPop  SelfEmployed  Professional        Office  
##   -7.256e-01     1.885e-07    -7.566e-03     1.333e-02     1.509e-02  
##  PrivateWork       Service      Minority    Production        Income  
##    7.161e-03     1.557e-02     1.323e-03     6.954e-03     1.866e-06  
## Unemployment           Men    WorkAtHome   MeanCommute      Employed  
##    5.614e-03    -2.180e-03     9.177e-03     9.794e-04     2.007e-03  
##      Transit   OtherTransp       Carpool         Drive       Poverty  
##    4.535e-03     1.125e-02     6.932e-03     5.460e-03     1.002e-03  
##   FamilyWork  
##    7.527e-03

The step() function says that by the lowest AIC, the best linear model to predict Biden’s votes per county consists of the predictors TotalPop, SelfEmployed, Professional, Office, PrivateWork, Service, Minority, Production, Income, Unemployment, Men, WorkAtHome, MeanCommute, OtherTransp, Drive, Carpool, Employed, Transit, Poverty, and FamilyWork. We can compare this to our transformed linear regression model with all predictors, and see that we have removed most of the predictors that were marked as non-significant by the summary() function. We can now create our linear model.

set.seed(123)
#linear regression biden
trans2.lin.votes.biden = lm(formula = votes^0.06061 ~ TotalPop + SelfEmployed + Professional + 
    Office + PrivateWork + Service + Minority + Production + 
    Income + Unemployment + Men + WorkAtHome + MeanCommute + 
    OtherTransp + Drive + Carpool + Employed + Transit + Poverty + 
    FamilyWork, data = votes.biden.tr)

summary(trans2.lin.votes.biden)
## 
## Call:
## lm(formula = votes^0.06061 ~ TotalPop + SelfEmployed + Professional + 
##     Office + PrivateWork + Service + Minority + Production + 
##     Income + Unemployment + Men + WorkAtHome + MeanCommute + 
##     OtherTransp + Drive + Carpool + Employed + Transit + Poverty + 
##     FamilyWork, data = votes.biden.tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55539 -0.04710  0.00592  0.05593  0.29550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.256e-01  1.250e-01  -5.806 7.23e-09 ***
## TotalPop      1.885e-07  8.240e-09  22.872  < 2e-16 ***
## SelfEmployed -7.566e-03  8.613e-04  -8.784  < 2e-16 ***
## Professional  1.333e-02  6.377e-04  20.908  < 2e-16 ***
## Office        1.509e-02  8.223e-04  18.351  < 2e-16 ***
## PrivateWork   7.161e-03  4.372e-04  16.379  < 2e-16 ***
## Service       1.557e-02  7.643e-04  20.369  < 2e-16 ***
## Minority      1.323e-03  1.305e-04  10.142  < 2e-16 ***
## Production    6.954e-03  6.604e-04  10.529  < 2e-16 ***
## Income        1.866e-06  3.249e-07   5.743 1.05e-08 ***
## Unemployment  5.614e-03  9.669e-04   5.806 7.21e-09 ***
## Men          -2.180e-03  9.306e-04  -2.343 0.019218 *  
## WorkAtHome    9.177e-03  1.410e-03   6.509 9.14e-11 ***
## MeanCommute   9.794e-04  4.354e-04   2.250 0.024556 *  
## OtherTransp   1.125e-02  2.019e-03   5.572 2.79e-08 ***
## Drive         5.460e-03  8.871e-04   6.155 8.76e-10 ***
## Carpool       6.932e-03  1.123e-03   6.175 7.71e-10 ***
## Employed      2.007e-03  5.777e-04   3.474 0.000521 ***
## Transit       4.535e-03  1.273e-03   3.563 0.000373 ***
## Poverty       1.002e-03  6.027e-04   1.662 0.096553 .  
## FamilyWork    7.527e-03  4.956e-03   1.519 0.128946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09276 on 2450 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:  0.7501 
## F-statistic: 371.8 on 20 and 2450 DF,  p-value: < 2.2e-16

We can see that our F-statistic has increased, and both our \(R^2\) and adjusted \(R^2\) values have not changed. This confirms that this is a stronger model since we have a cleaner linear regression without affecting the \(R^2\). This supports our belief that this is the best linear regression model for predicting Biden’s votes per county. We still have a very large adjusted \(R^2\), large F-statistic, and small p-value, which further supports the credibility of this model.

The same process could be done to predict the number of votes Trump got per county as well. While this analysis is very useful, I would prefer the results from a classification model over those from a linear regression model. First of all, the task of cleaning up the data to get the best possible linear regression is very tedious, and any of the classification processes are not as time-consuming. Additionally, in my opinion the results we obtained from our classification analysis were much more useful than the results from the regression. I believe it is of more use to know who the outright winner of each county would be, as in presidential elections that is what mainly matters, since these county-wide winners are what determine which candidate receives the electoral votes allocated for that respective state. The popular vote is quite meaningless for most states in actually being able to call the overall election winner, so the classification task is more useful in that sense. However, that is not to say that the linear regression model of predicting the number of votes a candidate would get per county is useless. This sort of model would be extremely useful to complement a classification task in specifically trying to predict candidate votes in the counties of swing states, since the popular vote is much more meaningful in these states since they have a higher likelihood to request ballot recounts.

21. (Open ended) Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc).

This analysis was extremely eye-opening and interesting as students who hope to pursue data science in the future. It was very interesting to learn about what different scenarios each type of model would thrive best in, and to see how different models would perform in regards to one question really helped put context to all that we have been learning this quarter. Seeing how 5 different models behaved with one type of question really helped contextualize all that we learned about why certain models are best to use in certain situations, as these reasons became very clear upon seeing the models either flourish or fail. We always see visualizations of trends of how groups of people behave in the election based on location, economic status, professional status, etc…, but seeing the concrete evidence of how each of those populations had such a big part to play was fascinating. This exercise re-affirmed the fact that the minority vote is so important, as that variable popped up over and over again in every machine learning model that we created as a part of this project.

It was also interesting to learn different data visualization techniques, and this is definitely a useful skill for the future. Being able to see, with different colors and types of charts, what the data that we have been analyzing means really helped cement the conclusions about how boosting was the best moel to use, and why kNN was so poor in this context. It was also interesting to use the clustering to view how Santa Barbara County was classified in a cluster, as it really had the project hit home and re-enforced that our voices are probably represented as a part of the Santa Barbara data. However, it would have also been interesting to have a better understanding as to the differences from counties in different states, especially counties of swing states.

What we truly understood as reasonable and crucial for our modeling was the data cleansing step for collinearity, as when census.clean was being manipulated it was to create a certain demographic that pertains to specific individuals. This step was crucial to make the data as clean as possible, as collinearity could skew the data greatly and definitely cause higher error rates than is true.

Collecting additional data would had made a difference to our knowledge. Given the state of the country right now with COVID-19, these results with 2017 data are probably not as close as they would have been if 2020 was not a year with a global pandemic. The true 2020 unemployment and poverty percentages are probably signficantly higher, as well as the WorkAtHome percentages. This would have created very different visuals and models.

Overall we have a much better understanding about the census dataset and the election dataset, as it indicates clearly as to Joe Bidens win along with the demographics of the individuals who votes from. Knowing the type of indivudual that voted for Donald Trump or Joe Biden is very interesting to see as to how each individual has different opinions but were seeing it number wise. This class overall has provided us with us many insights as to use for the future, and the methods used in this course are greatly appreciated and will definitely be used by us in our future careers as data scientists..